First some wrangling! Select columns we need, remove n/a since isn’t useful when looking at multivariate space, scale data to make sure no variable is over weighted in principle components just due to the units it is measured in.
Notice that body mass is in grams and values are in 1000’s compared with bill length in mm with values in 10’s
Notes:
* ends_with() is a helper function to select all variables ending with a certain string
* drop_na() drops all rows with n/a values, can write variables inside () to specifify which columns to drop n/a from
* scale() scales the data
* prcomp() runs principle components and changes your df into a list * autoplot() uses ggplot2 to draw a particular plot for an object of a particular class in a single command. For ex, for PCA data type it will assume you want a PCA biplot
penguins_pca <- penguins %>%
select(body_mass_g, ends_with("_mm")) %>%
drop_na() %>%
scale() %>%
prcomp()
penguins_pca$rotation # brings up the loadings for each of the 4 variables along that principle component
## PC1 PC2 PC3 PC4
## body_mass_g 0.5483502 0.084362920 -0.5966001 -0.5798821
## bill_length_mm 0.4552503 0.597031143 0.6443012 -0.1455231
## bill_depth_mm -0.4003347 0.797766572 -0.4184272 0.1679860
## flipper_length_mm 0.5760133 0.002282201 -0.2320840 0.7837987
# Create biplot with autoplot(), doesn't have info about variable loadings, or label penguin spp!
autoplot(penguins_pca)
# Create new plot which contains info we can update aesthetics by, like spp and other variables
penguin_complete <- penguins %>%
drop_na(body_mass_g, ends_with("_mm"))
# Includes PCA data, and the observations used to make that PCA
# Observations used to create PCA and data used for aesthetics MUST align!
autoplot(penguins_pca,
data = penguin_complete,
colour = 'species',
loadings = TRUE,
loadings.label = TRUE)+
theme_minimal()
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Notes:
* clean_names() default is to change all column headings to lower snake case * mutate() can also be used to transform existing columns, not just to create new ones
* across() is a helper function, can be used to say like across last 3 cols, or cols that end with…
* tolower() transform to lowercase * str_sub() extracts or replaces substrings (part of a string) from a character vector
fish_noaa <- read_excel(here("data", "foss_landings.xlsx")) %>%
clean_names() %>%
mutate(across(where(is.character), tolower)) %>% # Across any column where that col is a character, want to use function tolower() to change to all lowercase
mutate(nmfs_name = str_sub(nmfs_name, end = -4)) %>% # overwrites col since name nmfs_name is same name as existing column
filter(confidentiality == "public")
Notes:
* ggplotly() create interactive graph
* Can highlight using gghighlight() to highlight certain series or values
fish_plot <- ggplot(data = fish_noaa, aes(x = year, y = pounds))+
geom_line(aes(color = nmfs_name), show.legend = FALSE)+
theme_minimal()
fish_plot
## Warning: Removed 6 row(s) containing missing values (geom_path).
ggplotly(fish_plot)
# Use gghighlight to highlight certain series
ggplot(data = fish_noaa, aes(x = year, y = pounds, group = nmfs_name))+
geom_line()+
theme_minimal()+
gghighlight(nmfs_name == "tunas")
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: nmfs_name
## Warning: Removed 6 row(s) containing missing values (geom_path).
# Use gghighlights to highlight certain values
ggplot(data = fish_noaa, aes(x = year, y = pounds, group = nmfs_name))+
geom_line(aes(color = nmfs_name))+
theme_minimal()+
gghighlight(max(pounds) > 1e8)
## label_key: nmfs_name
## Warning: Removed 6 row(s) containing missing values (geom_path).
lubridate(), mutate(), make a graph with months in logical orderNotes:
* Can use read_csv() using url links, but must weigh benefits/ costs of not knowing exactly what the dataset looked like when you did analysis. May want to download hard copy
* mdy() from lubridate to convert to date
* month.abb[] function in base R to replace abbreviation of month name by number, month.name[] replaces with full name, could use case_when() to manually do the same
monroe_wt <- read_csv("https://data.bloomington.in.gov/dataset/2c81cfe3-62c2-46ed-8fcf-83c1880301d1/resource/13c8f7aa-af51-4008-80a9-56415c7c931e/download/mwtpdailyelectricitybclear.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## date = col_character(),
## kWh1 = col_double(),
## kW1 = col_double(),
## kWh2 = col_double(),
## kW2 = col_double(),
## solar_kWh = col_double(),
## total_kWh = col_double(),
## MG = col_double()
## )
monroe_ts <- monroe_wt %>%
mutate(date = mdy(date)) %>%
mutate(record_month = month(date)) %>% #creates new column with just the month from the date column
mutate(month_name = month.abb[record_month]) %>% # Create month name and not just the month number
mutate(month_name = fct_reorder(month_name, record_month))# change month to be an ordered factor so it won't show up alphabetically in ggplot
ggplot(data = monroe_ts, aes(x = month_name, y = total_kWh))+
geom_jitter()